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Abstract. Retrieving resources in a distributed environment is more 
difficult than finding data in centralised databases. In the last decade 
P2P system arise as new and effective distributed architectures for re- 
source sharing, but searching in such environments could be difficult and 
time-consuming. In this paper we discuss efficiency of resource discov- 
ery in PROSA, a self-organising P2P system heavily inspired by social 
networks. All routing choices in PROSA are made locally, looking only 
at the relevance of the next peer to each query. We show that PROSA 
is able to effectively answer queries for rare documents, forwarding them 
through the most convenient path to nodes that much probably share 
matching resources. This result is heavily related to the small-world 
structure that naturally emerges in PROSA. 



1 Introduction 

Organisation of electronical resources and documents is of the most importance 
for efficient searching and retrieval. Nowadays the WWW is a (negative) example 
of how searching and obtaining informations from an unstructured knowledge 
base could really become difficult and frustrating. In the case of the World 
Wide Web, this problem is faced and partially resolved by centralised searching 
engines, such as Google, MSN-Search, Yahoo and so on, which can help users in 
pruning away unuseful resources during searches. But searching strategies used 
by web indexing engines cannot be easily adopted in a P2P environment, mainly 
because nodes of such a distributed system cannot be compared to web-servers. 
Each peer shares a small amount of resources, can join and leave the network 
many times in a week and usually searches and retrieve resources belonging to a 
small number of different topics. In the last few years many P2P structures have 
been proposed, in order to build a valuable and efficient distributed environment 
for resource sharing. 

The problem is that existing P2P systems usually ask the user to choose 
between efficiency and usability. In fact, while DHT systems allow fast resource 
searching [3] [12] [19] introducing unnatural indexing models, unstructured and 
weakly structured P2P systems [5] [20] [2] usually allow users to easily express 



queries but have poor performance with respect to bandwith and time con- 
sumption. 

In this work we analyse retrieving performance of PROSA (P2P Resource 
Organisation by Social Acquaintances), a P2P system heavily inspired by so- 
cial networks: joining, searching resources and building links among peers in 
PROSA are performed in a social way. Each peer gains a certain amount of 
strong links to peers which share similar resources and also maintains weak links 
to far away peers. 

The linking phase is similar to a birth: each peer is given just a couple of 
weak links which can be used for query forwarding. Queries for resources are 
forwarded through outgoing links to other peers, in accordance with a defined 
"similarity" between the query and shared resources. New relationships in real 
social networks arise because people have similar interests, culture and knowl- 
edge. In a similar way, new links among peers in PROSA are established when 
a query is forwarded and successfull answered, so that peers which share similar 
resources finally get connected together. 

In this paper we focus on the ability of PROSA in answering queries with a 
sufficient number of results, even if a small amount of existing documents match 
them. Matching documents are retrieved in an efficient way, forwarding queries 
to a small amount of nodes using just a few "right" links, thanks to small-world 
structure that naturally emerges in a PROSA network. 

In section 2 we give a brief formal description of involved algorithms ; section 
3 reports simulation results, focused on retrieval of rare resources; in section 4 
the efficiency of the query routing algorithm is discussed, while section 5 propose 
guidelines for future work. 

2 PROSA: a brief description 

As stated above, PROSA is a P2P network based on social relationships. More 
formally, we can model PROSA as a directed graph: 

PROSA = (P, C, P r , Label) (1) 

V denotes the set of peers (i.e. vertices), C is the set of links I = (s,i) (i.e. 
edges), where t is a neighbour of s. For link I = (s, t), s is the source peer and t 
is the target peer. All links are directed. 

In P2P networks the knowledge of a peer is represented by resources it shares 
with other peers. In PROSA the mapping P r : V — > 2 n , associates peers 
with resources. For a given peer s E V, P r (s) is the set of resources hosted 
by peer s. Given a set of resources, we define a function R c : 2 n — > C that 
provides a sort of compact description of all resources. We also define a function 
Pk :V — > C, such that, for a given peer s, Pk{s) is a compact description of the 
peer knowledge (PK - Peer Knowledge). It can also be obtained combining P r 
and R c : P k (s) = R c (P r (s)). 

Relationships among people in real social networks are usually based on 
similarities in interests, culture, hobbies, knowledge and so on [7][4][1]. Usually 



these kind of links evolve from simple "acquaintance-links" to what we called 
"semantic-links" . To implement this behaviour three types of links have been 
introduced: Acquaintance-Link (AL), Temporary Semantic-Link (TSL) and Full 
Semantic-Link (FSL). TSLs represent relationships based on a partial knowl- 
edge of a peer. They are usually stronger than ALs and weaker than FSLs. 

In PROS A, if a given link is a simple AL, then the source peer does not 
know anything about the target peer. If the link is a FSL, the source peer is 
aware of the kind of knowledge owned by the target peer (i.e. it knows Pk(t), 
where t € V is the target peer). Finally, if the link is a TSL, the peer does 
not know the full Pfc(i) of the linked peer; it instead has a Temporary Peer 
Knowledge (TPk) which is based on previously received queries from the source 
peer. Different meanings of links are modelled by means of a labelling function 
Label: for a given link I = (s,t) 6 L, Label(l) is a vector of two elements [e, w]: 
the former is the link label and the latter is a weight used to model what the 
source peer knows about the target peer; this is computed as follows: 

- if e = AL w = 

- if e = TSL =>w = TP k 

- if e = FSL^w = P k (t) 

In the next two sections, we give a brief description of how PROSA works. 
A detailed description of PROSA can be found in [6]. 



2.1 Peer Joining PROSA 

The case of a node that wants to join an existing network is similar to the birth 
of a child. At the beginning of his life a child "knows" just a couple of people 
(his parents). A new peer which wants to join, just looks for n peers at random 
and establishes ALs to them. These links are ALs because a new peer doesn't 
know anything about its neighbours until he doesn't ask them for resources. This 
behaviour is quite easy to understand: when a baby comes to life he doesn't know 
anything about his parents and relatives. The PROSA peer joining procedure 
is described by algorithm 1. 



Algorithm 1 JOIN: Peer s joining to PROS A(V, C, P r , Label) 

Require: PROSA(T, C, P r , Label), Peer s 
1: 1ZV <— rnd(P,n) {Randomly selects n peers of PROSA } 
2: ?<-PUs {Adds s to set of peers} 

3: C <— C U {(s,t),Wt £ 7ZV} {Links s with the randomly selected peers} 
4: Vt £ 1ZP => Label{p, q) <- [AL, 0] {Sets the added links as AL} 



2.2 PROS A dynamics 



In order to show how does PROS A work, we need to define the structure of a 
query message. Each query message is a quadruple: 

Q M = (qid,q,s,n r ) (2) 

where qid is a unique query identifier to ensure that a peer does not respond to 
a query more then once; q is the query, expressed according to the used knowledge 
model 1 ; s G P is the source peer and n r is the number of required results. 
PROS A dynamic behaviour is modelled by algorithm 2 and is strictly related 
to queries. When a user of PROS A asks for a resource on a peer s, the inquired 
peer s builds up a query q and specify a certain number of results he wants to 
obtain n r . This is equivalent to call ExecQuery(PROSA, s, (qid, q, s, n r )). 



Algorithm 2 ExecQuery: query q originating from peer s executed on peer cur 
Require: PROS A(V , C, P r , Label), cur £ V, q € QM 

1: Result <- 

2: if cur 7^ s then 

3 : U pdateLink ( PR OS A , cur, s,q) 

4: end if 

5: (Result, numRes) <— ResourcesRelevance(PROSA, q, cur, n r ) 

6: if numRes = then 

7: / -> SelectNextPeer(PROSA, cur, q) 

8: if / =fc null then 

9: ExecQuery(PROSA, f,qrn) 
10: end if 
11: else 

12: SendMessage(s, cur, Result) 

13: C <— £ U (s, cur) 

14: Label(s,cur) <- [FSL, P fc (cwr)] 

15: if numRes < n r then 

16: {- Semantic Flooding -} 

17: for all £ £ Neighbor hood(cur) do 

18: re? — > PeerRelevance(P k (t),q) 

19: if re/ > Threshold then 

20: (jm <— (gicf, g, s, n r — numRes) 

21: ExecQuery(PROSA,t,qm) 

22: end if 

23: end for 

24: end if 

25: end if 



1 If knowledge is modelled by Vector Space Model, for example, q is a state vector of 
stemmed terms. If knowledge is modelled by ontologies, q is an ontological query, 
and so on 



The first time ExecQuery is called, cur is equal to s and this avoids the 
execution of instruction # 3. Following calls of ExecQuery, i.e. when a peer 
receives a query forwarded by another peer, use function UpdateLink, which 
updates the link between current peer cur and the forwarding peer prev, if 
necessary. If the requesting peer is an unknown peer, a new TSL link to that 
peer is added having as weight a Temporary Peer Knowledge (TP k ) based on 
the received query message. Note that a TP k can be considered as a "good 
hint" for the current peer, in order to gain links to other remote peers. It is 
really probable that the query would be finally answered by some other peer 
and that the requesting peer will eventually download some of the resources 
that matched it. It would be useful to record a link to that peer, just in case 
that kind of resources would be requested in the future by other peers. If the 
requesting peer is a TSL for the peer that receives the query, the corresponding 
TPk is updated. If the requesting peer is a FSL, no updates are necessary. 

The relevance of a query with respect to the resources hosted by a peer is 
evaluated calling function ResourcesRelevance. Two possible cases can hold: 

- If none of the hosted resources has a sufficient relevance, the query has to be 
forwarded to another peer /, called "forwarder" . This peer is selected among 
s neighbours by SelectForwarder, using the following procedure: 

- Peer s computes the relevance between query q and the weight of each 
links connecting itself to his neighbourhood. 

- It selects the link with the highest relevance, if any, and forward the 
query message to it. 

- If the peer has neither FSLs nor TSLs, i.e. it has just ALs, the query 
message is forwarded to one link at random. 

This procedure is described in algorithm 2, where subsequent forwards are 
performed by means of recursive calls to ExecQuery. 

— If the peer hosts resources with sufficient relevance with respect to q, two 
sub-cases are possible: 

- The peer has sufficient relevant documents to full-fill the request. In this 
case a result message is sent to the requesting peer and the query is no 
more forwarded. 

- The peer has a certain number of relevant documents, but they are not 
enough to full-fill the request (i.e. they are < n r ). In this case a re- 
sponse message is sent to the requester peer, specifying the number of 
matching documents. The message query is forwarded to all the links 
in the neighbourhood whose relevance with the query is higher than a 
given threshold (semantic flooding). The number of matched resources 
is subtracted from the number of total requested documents before each 
forward step. 

When the requesting peer receives a response message it build a new FSL to 
the answering peer and then presents results to the user. If the user decides to 
download a certain resource from another peer, the requesting peer directly con- 
tacts the peer owning that resource asking for download. If download is accepted, 
the resource is sent to the requesting peer. 



3 Information Retrieval in PROS A 



Other studies about PROSA [6] [18] revealed that it naturally evolves to a 
small-world network, with a really high clustering coefficient and a relatively 
small average path length between peers. 

The main target of this work is to show that PROSA does not only has de- 
sirable topological properties, but also that resource searching can be massively 
improved exploiting those characteristics. The fact that all peers in PROSA 
are connected by a small number of hops does not guarantees anything about 
searching efficiency. In this section we show that searching resources in PROSA 
is really fast and successfull, mainly because peers that share resources in the 
same topic usually results to be strongly connected with similar peers. 

3.1 Two words about simulations 

In order to show that PROSA can be used to efficiently share resources in 
a P2P environment, we developed a event-driven functional simulator written 
in Python. The knowledge base used for simulations is composed by scientific 
articles in the field of math and phylosophy. Articles about math come from 
"Journal of American Mathematical Society" [15], "Transactions of the Ameri- 
can Mathematical Society" [17] and "Proceedings of the American Mathematical 
Society" [16], for a total amount of 740 articles. On the other hand, articles in 
the field of philosophy come from "Journal of Social Philosophy" [9], "Jour- 
nal of Political Philosophy" [8], "Philosophical Issues" [10] and "Philosophical 
Perspectives" [11], for a total amount of 750 articles. 

The simulator uses a Vector Space [13] knowledge model for resources. Each 
document is represented by a state vector which contains the highest 100 TF- 
IDF [14] weights of terms contained into the document. 

Each peer contains, on average, 20 ± 5 articles in the same topic. Nodes per- 
form 80% of queries in the same topic of the hosted resources and the remaining 
20% in the other topic. We choose to do so after some studies about queries 
distribution in a Gnutella P2P system [5] and with real social communities in 
mind, where the most part of requests for resources are focused on a really small 
amount of topics. 

3.2 Number of retrieved documents 

One of the most relevant quality measure of a resource searching algorithm is 
the number of documents retrieved by each query. In this paragraph we examine 
results obtained with PROSA, using the query mechanism described in section 
2. We also compare PROSA to other searching strategies, such as random walk 
and flooding. 

Figure 1(a) shows a comparison of average number of retrieved documents 
in a PROSA network for different number of nodes, when each node performs 
15 queries on average. 



As showed in figure 1 (a) , the best performance is obtained by flooding, since 
the average number of retrieved documents per query is about 10, that is the 
number of documents required by each query (n r ) 2 . Nevertheless, PROS A is 
able to retrieve about 4 documents per query, on average, and this result is still 
better than that obtained with a random walk, which usually retrieves only 2.8 
documents per query. 

This suggests that the query routing algorithm, based on local link ranking, 
is really efficient and usually let queries "flow" in the direction of nodes that can 
probably answer them. We note that PROS A is able to retrieve a relatively high 
number of documents also if compared with a simple flooding. This is a good 
result, since flloding is known as becing the optimal searching strategy: queries 
are actually forwarded to all nodes, so all existing and matching documents are 
retrieved, until the number of required documents has not been obtained. 

In figure 1(b) the average number of retrieved documents per successfull 
query is reported. The best perormance is once again obtained by flooding, while 
PROSA retrieves an average of 4.2 documents for each successfull query over 
10 documents required. Random walk has, once again, the worst performance. 

Looking only at the number of retrieved documents could be misleading: it is 
not important to have a small amount of queries answered with a high number 
of documents. It is desireable having almost all feasible queries 3 answered by 
a sufficient number of documents. Figure 1(c) shows the percentage of retrieved 
documents for PROSA, flooding and random walk, on the same PROSA net- 
work with different network sizes. Note that in every case the average amount 
of unfeasible queries is around 6%. 

The highest percentage of answered queries is obtained by flooding the net- 
work, since about 94% of queries have an answer. This means that practically all 
the queries are answered, if we except those that have no matching documents. 
A valuable result is obtained also by PROSA: 84% to 92% of all queries are 
answered, while random walk usually returns result for less than 80% of issued 
queries 4 . The percentage of answered queries increases whith network size, for 
all searching strategies, because all nodes have an average number of 20 docu- 
ments: more nodes means more documents, i.e. an higher probability of finding 
matching documents. 

3.3 Query recall 

Either if it is an important parameter for a resource searching and retireving 
strategy, the number of retrieved documents is not the best measure of how 

2 A query is no more forwarded if a sufficient number of documents has beed retrieved, 
as explained in 2 

3 A query is feasible if there exist matching documents to answer it. Otherwise it is 
considered unfeasible 

4 If a query eventually enters an unconnected component, it cannot be further for- 
warded. 
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much documents a searching algorithm is able to retrieve. Since not all queries 
match the same number of documents, it is better to measure the percentage 
of retrieved documents over all matching documents. A valuable measure is the 
so-called "recall", i.e. the percentage of distinct retrieved documents over the 
total amount of distinct existing documents that match a query. In figure 2(a) 
we show the recall distribution for PROSA, flooding and random walk when 
each node performs 15 queries on average. 

The best performance is obtained, once again, flooding the network: about 
60% of queries have a recall of 100%, and about 80% of queries have a recall of 
50%. Searching by flooding could not return all documents because PROSA is a 
directed graph, and unconnected components could still exist. Also PROSA has 
high recall: about 20% of queries obtain all matching documents, while 45% of 
queries are answered with one half of the total amount of matching documents. 
Random walk is the worst case: about 80% of queries has a recall of less than 
50% and only 8% of queries obtain all matching documents. 

Recall measured as the simple percentage of retrieved document over the 
total amount of matching documents does not take into account the fact that 
in PROSA queries are requested to retrieve n r documents and no more. This 
fact could practically influence the recall measure for PROSA networks, since 
queries are no more forwarded if a sufficient number of documents has been 
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retrieved. On the other hand, it is important to analyse the recall in the case 
of "rare" queries. Note that we consider a query as beeing "rare" when the 
total number of matching documents is lower than the number of requested 
documents; similarly a query is considered "common" if it matches more than 
n r 5 . 

Figure 2(b) shows the cumulative normalised distribution of recall for rare 
queries, while figure 2(c) reports the cumulative distribution for common queries. 

Results reported in figure 2(b) are really interesting: PROS A answers 35% of 
rare queries by retrieving all matching documents, while 75% of queries retrieve 
at least 50% of the total amount of matching documents; less than 10% of queries 
obtain less than 30% of matching documents. Performance of a random walk is 
worse than that obtained by PROS A: only 20% of queries obtain all matching 
documents, while more than 30% of them obtain less than 30% of matching 
results. 

The situation is slightly different for common queries. As reported in figure 
2(c), PROSA is able to retrieve at least 10 documents for 20% of issued queries 
and, in every case, at least one document is found for 99% of queries, and at least 



5 Reported results are relative to n r — 10 



3 documents for 85% of queries. We think that this behaviour is also affected by 
the chosen value of n r . 

In order to better understand benefits of using PROS A, it is interesting 
to look also at other measures that could clarify some PROSA characteristics. 
For instance, recall results are of poor relevance without a measure of how fast 
answers are obtained. A feasible measure of speed could be the average query 
deepness, defined as the average number of "levels" a query is forwarded far 
away from the source node. 

In figure 2(d) we show average deepness of successfull queries for PROSA, 
flooding and random walk on the same PROSA network for different number 
of peers. 

Query deepness for PROSA is around 3 and is not heavily affected from the 
network size, while that of flooding and random walk is much higher (from 30 
to 60 and from 120 to 600, respectively). Better results obtained by PROSA 
cannot be simply explained by network clustering coefficient, since all simulation 
are performed on the same network. We suppose that it is mainly due to the 
searching algorithm implemented by PROSA itself: it is able to find a convenient 
and efficient route to forward queries along, avoiding a large number of forwards 
to non-relevant nodes. 

4 Energetical Considerations 

An important parameter to take in account in order to quantify the efficiency of 
a searching strategy is the "energy" needed to forward and answer each query. 
In a theoretical model it is probably of no great importance how much power 
is needed in order to answer a query. But for real systems this is a crucial 
parameter. One of the main issues with unstructured P2P networks such as 
Gnutella [5] is that queries waste a lot of bandwith, since a large fraction of the 
network is flooded and a great amount of nodes are involved in answering each 
query. It is possible to roughly define the average "energy" required for each 
query using equation 3, where N q is the number of nodes to which the query has 
been forwarded and L q is the number of links used during query routing, b and 
c are dimensional scaling factors. 

E q = b ■ L q + c ■ N q (3) 

The definition given here for query energy is quite simple: it takes into ac- 
count the required bandwith, represented by the factor b ■ L q , and the com- 
putational power needed by nodes in order to process queries, represented by 

C'N g . 

To estimate the amount of energy required to answer queries, we could look 
at the average number of nodes and the average number of links involved in each 
query. Note that N q and L q are usually different, since a node can be reached 
using many paths: either if it processes the query only once 6 , the bandwith 
wasted to forward the query to it cannot be saved. 

6 requests with the same query id are ignored 



Figure 3(a) and 3(b) show, respectively, the average number of nodes involved 
and the average number of links used by successful queries, both for PROSA 
and a simple random walk search. 
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query query 



Fig. 3. 

Since random walk uses a higcr number of nodes and a higher number of links 
in order to answer the same queries, it is clear that PROSA requires less energy. 
On the other hand, since PROSA is able to retrieve more matching documents 
than a random walk (as shown in section 3.2), we can state that PROSA is 
really efficient with respect to average "energy" required to answer queries. 

5 Conclusions and Future Work 

This work presented a formal description of PROSA, a self-organising system 
for P2P resource sharing heavily inspired by social networks. Simulations show 
that resource searching and retrieving in PROSA is really efficient, because of 
the ability of peers in making good local choices that result in fast and successful 
global query routing. Interesting results are obtained for query recall measured 
on rare documents: PROSA is able to route queries for those documents directly 
to nodes that probably can successfully answer them. Since PROSA results to 
be a small-world, all nodes are reached in a few steps, avoiding to waste bandwith 
and processing power. Future works include further studying PROSA in order 
to discover emerging structures, such as semantic groups and communities of 
similar peers. 
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